Complete Java Execution Pipeline
The journey from a .java source file to CPU instructions is a multi-stage pipeline involving lexical analysis, bytecode generation, class loading, bytecode verification, linking, JIT compilation, machine code generation, and finally CPU execution. Understanding each stage is fundamental to diagnosing performance problems, memory issues, and subtle concurrency bugs in production systems.
Full Pipeline Overview
Stage 1 — javac Compilation
The javac compiler performs: lexical analysis (tokenization), syntax analysis (parse tree), semantic analysis (type checking, name resolution), constant folding, and bytecode generation. It outputs a .class file containing the JVM bytecode — a platform-independent intermediate representation.
The .class file contains: a magic number (0xCAFEBABE), major/minor version, constant pool, access flags, class/interface hierarchy, field descriptors, method descriptors, bytecode for each method, and attribute tables (SourceFile, LineNumberTable, LocalVariableTable, etc.).
// Java Source public class Add { public static int add(int a, int b) { return a + b; } } // Compiled bytecode (javap -c Add.class) public static int add(int, int); Code: 0: iload_0 // push a onto operand stack 1: iload_1 // push b onto operand stack 2: iadd // pop 2, add, push result 3: ireturn // return int on top of stack
Stage 2 — Class Loading
The ClassLoader subsystem reads .class bytes into the JVM's Method Area (Metaspace). The three built-in loaders form a strict delegation hierarchy. Loading is lazy by default — a class is not loaded until it is first actively used.
Stage 3 — Bytecode Verification
The bytecode verifier enforces JVM type safety. It checks: that the stack never overflows or underflows, that local variable types are consistent, that field and method references are valid, and that final methods are not overridden. This is a critical security boundary — it ensures that even malicious bytecode cannot subvert JVM memory safety.
Stage 4 — Linking (Prepare + Resolve)
Preparation allocates memory for static fields and sets them to default values (0, null, false). Resolution replaces symbolic references in the constant pool with direct references (memory pointers) to classes, fields, and methods. Resolution can be eager or lazy depending on JVM flags.
Stage 5 — Initialization
The class's <clinit> method is invoked — executing static field assignments and static initializer blocks in textual order. Initialization is synchronized: the JVM guarantees that only one thread initializes a class, and subsequent threads see the initialized state without synchronization (via the class loading lock).
Stage 6 — Interpreter Execution (Tier 0)
The template interpreter dispatches each bytecode instruction via a pre-generated native code fragment (a "template"). Each bytecode has its own assembly stub. The interpreter maintains a per-method invocation counter. When a method's counter exceeds CompileThreshold (default: 10,000), it is submitted to the JIT compiler queue.
Stage 7 — Tiered JIT Compilation
HotSpot uses tiered compilation: Tier 0 = interpreter, Tier 1 = C1 (no profiling), Tier 2 = C1 (limited profiling), Tier 3 = C1 (full profiling), Tier 4 = C2 (maximum optimization). Methods graduate through tiers based on invocation and back-edge (loop) counters.
Stage 8 — Machine Code Execution
JIT-compiled code is stored in the Code Cache (a JVM-managed native memory region). Execution jumps into this cache, bypassing the interpreter entirely. The JVM may deoptimize compiled code back to the interpreter if speculative optimizations (like inlining based on CHA) are invalidated by class loading events.
JVM Architecture — System Level
HotSpot is a complex C++ application (~4 million lines of code in OpenJDK). Understanding its subsystem boundaries and their interactions is essential for advanced JVM tuning and modification.
ClassLoader Subsystem
Responsible for finding, loading, verifying, and initializing class files. Interacts directly with the file system, JAR files, module descriptors, and the network (for remote class loading). The result of loading is stored in Metaspace as a Klass C++ object, paired with a heap-allocated java.lang.Class mirror object.
Runtime Data Areas
Metaspace (native memory): class metadata, method bytecode, constant pools, vtables, itables. Heap (GC-managed): all Java objects and arrays. Thread Stacks: per-thread, contains frames for each method call. PC Register: points to current bytecode (undefined for native methods). Native Method Stack: C stack used when executing JNI native methods.
Execution Engine
The execution engine has three major components that cooperate continuously: the template interpreter (executes bytecode until hot), the JIT compiler (compiles hot methods to native code), and the garbage collector (reclaims dead objects while respecting safepoints and GC barriers emitted by both the interpreter and JIT).
Native Interface (JNI)
JNI bridges Java and native C/C++ code. The JVM pushes/pops JNI frames on the thread's native call stack. JNI handles require explicit lifecycle management — GC roots are pinned while in native frames to prevent object relocation by concurrent collectors.
HotSpot Source Code Structure
Navigating the OpenJDK HotSpot source tree is daunting initially. Understanding the layout enables targeted exploration and modification.
Key Source Files for JVM Internals
| File | Responsibility | Key Concepts |
|---|---|---|
oops/oop.hpp | Base class for all Java objects in JVM | Mark word, klass pointer, field layout |
oops/klass.hpp | Class metadata representation | vtable, itable, layout helper |
oops/markWord.hpp | Object header mark word layout | Lock state, hash code, GC age |
runtime/safepoint.cpp | Safepoint polling and synchronization | Thread suspension, deoptimization |
interpreter/templateTable.cpp | Per-bytecode native code templates | Bytecode dispatch, stack manipulation |
gc/g1/g1CollectedHeap.cpp | G1 GC heap management | Region management, concurrent marking |
compiler/compileBroker.cpp | JIT compilation task management | Compilation queue, tier transitions |
OOP Hierarchy
In HotSpot, "oop" means Ordinary Object Pointer — a pointer to a Java object on the heap. The C++ type hierarchy is:
oop // base pointer type └─ instanceOop // regular Java object instance └─ arrayOop // base for arrays └─ objArrayOop // array of object references └─ typeArrayOop // array of primitives (int[], byte[], etc.) Klass // class metadata (in Metaspace) └─ InstanceKlass // regular class └─ ArrayKlass // array class └─ ObjArrayKlass // reference array └─ TypeArrayKlass // primitive array
Class Loading Internals
Class loading is a three-phase process: Loading (finding and reading the class file bytes), Linking (verification + preparation + resolution), and Initialization (executing <clinit>). Each phase has strict rules and ordering guarantees.
ClassLoader Hierarchy
Parent Delegation Model
When a ClassLoader receives a loadClass(name) request, it always delegates to its parent first. Only if the parent cannot find the class does the child attempt to load it. This ensures that core Java classes (e.g., java.lang.String) are always loaded by Bootstrap, preventing malicious replacement of system classes.
// ClassLoader.loadClass() — simplified logic protected synchronized Class<?> loadClass(String name, boolean resolve) { Class<?> c = findLoadedClass(name); // check cache if (c == null) { try { if (parent != null) c = parent.loadClass(name, false); // delegate up else c = findBootstrapClass(name); // bootstrap } catch (ClassNotFoundException e) { c = findClass(name); // child handles it } } if (resolve) resolveClass(c); return c; }
Class Identity
In the JVM, a class is uniquely identified by the tuple (fully-qualified-name, ClassLoader-instance). Two classes with the same name loaded by different ClassLoaders are completely different types — instances of one cannot be cast to the other, even if they have identical bytecode. This is the source of the notorious ClassCastException in application servers.
OutOfMemoryError: Metaspace in hot-deploy scenarios.Custom ClassLoader Implementation
public class BytecodeClassLoader extends ClassLoader { private final byte[] bytecode; public BytecodeClassLoader(byte[] bytecode) { super(ClassLoader.getSystemClassLoader()); this.bytecode = bytecode; } @Override protected Class<?> findClass(String name) { // defineClass performs Linking (verify + prepare + resolve) return defineClass(name, bytecode, 0, bytecode.length); } }
Dynamic Module Loading (JPMS)
In Java 9+, the module system wraps ClassLoaders with module visibility rules. A class can only load another class if the module declaring it exports the package. The Bootstrap CL is now responsible for java.base, while the Platform CL covers other JDK modules. Named modules provide stronger encapsulation guarantees than the old classpath model.
Linking — Deep Internals
| Phase | What Happens | Errors Thrown |
|---|---|---|
| Verification | Bytecode format checks, type safety validation, stack shape analysis, instruction reachability | VerifyError |
| Preparation | Static field memory allocated; set to zero-values. No code executed yet. | OOM for Metaspace |
| Resolution | Symbolic refs in constant pool replaced with direct refs (pointers). Lazy or eager per JVM. | NoClassDefFoundError, LinkageError |
Class Initialization Order — Exact Execution Rules
Class initialization ordering is one of the most-tested JVM topics in FAANG interviews. The JVM specification defines a precise ordering that must be understood at the bytecode level.
Complete Ordering Rules
Initialization Triggers (Active Uses)
A class is initialized only when it is actively used. Passive uses (e.g., accessing a static final compile-time constant, creating an array of the type) do not trigger initialization.
| Trigger | Active Use? | Notes |
|---|---|---|
new ClassName() | ✅ Yes | Most common trigger |
| Call a static method | ✅ Yes | Inherited static methods initialize the declaring class |
| Access/assign a static field | ✅ Yes | Unless it's a static final compile-time constant |
Class.forName("Foo") | ✅ Yes | By default; use initialize=false to suppress |
| First invocation of a subclass | ✅ Yes | Triggers superclass initialization first |
Access static final int X = 5 | ❌ No | Compile-time constant — inlined by javac |
new Foo[] | ❌ No | Array creation doesn't initialize element type |
Tricky Interview Case — Static Initialization Ordering
class Parent { static int x = initX(); // Step A static { System.out.println("Parent static block, x=" + x); } // Step B static int initX() { return 10; } } class Child extends Parent { static int y = 20; // Step D (after Parent init) static { System.out.println("Child static block, y=" + y); } // Step E int instanceY = 100; // Step G (per instance) { System.out.println("Child IIB, instanceY=" + instanceY); } // Step H Child() { System.out.println("Child constructor"); } // Step I } // new Child() output: // Parent static block, x=10 ← Parent <clinit> runs first // Child static block, y=20 ← Child <clinit> runs second // Child IIB, instanceY=100 ← IIB before constructor body // Child constructor ← Constructor last
The Forward Reference Trap
class ForwardRef { static int a = b + 1; // b is 0 here! (default value during preparation) static int b = 5; public static void main(String[] args) { System.out.println(a); // prints 1, NOT 6! System.out.println(b); // prints 5 } }
a = b + 1 executes when b is still 0, giving a = 1, not 6.Initialization Deadlock
The JVM uses a per-class initialization lock. If two threads race to initialize classes A and B, where A's <clinit> references B and B's <clinit> references A, a circular initialization deadlock occurs. The JVM specification (§5.5 step 3) documents this as a potential deadlock — the JVM does not detect or prevent it.
JVM Bytecode Engine — Stack-Based Execution
The JVM is a stack-based virtual machine. Unlike register-based VMs (like Dalvik/ART), all bytecode instructions operate on an operand stack within a stack frame. Understanding this model is essential for reasoning about bytecode correctness and JIT optimization opportunities.
Stack Frame Structure
Bytecode Example — Arithmetic
// Java source public static int compute(int a, int b) { int c = a + b; return c * 2; } // Bytecode (javap -c) 0: iload_0 // push local[0] (a) → stack: [a] 1: iload_1 // push local[1] (b) → stack: [a, b] 2: iadd // pop 2, add, push → stack: [a+b] 3: istore_2 // pop → local[2] (c) → stack: [] 4: iload_2 // push local[2] (c) → stack: [c] 5: iconst_2 // push constant 2 → stack: [c, 2] 6: imul // pop 2, multiply → stack: [c*2] 7: ireturn // return int on TOS
Complete Bytecode Instruction Reference
| Category | Instructions | Description |
|---|---|---|
| Load (stack push) | iload, lload, fload, dload, aload | Push local variable onto stack |
| Store (stack pop) | istore, lstore, fstore, dstore, astore | Pop stack top to local variable |
| Constants | iconst_0..5, lconst, fconst, dconst, aconst_null, bipush, sipush, ldc | Push literal/constant pool values |
| Arithmetic | iadd, isub, imul, idiv, irem, ineg, ishl, ishr, iushr, iand, ior, ixor | Integer arithmetic/bitwise |
| Long/Float/Double | ladd, fadd, dadd, ... | 64-bit and floating-point ops |
| Conversion | i2l, i2f, i2d, l2i, f2i, d2i, i2b, i2c, i2s | Primitive type widening/narrowing |
| Object creation | new, newarray, anewarray, multianewarray | Heap allocation |
| Field access | getfield, putfield, getstatic, putstatic | Instance and static field I/O |
| Method invocation | invokevirtual, invokeinterface, invokestatic, invokespecial, invokedynamic | Method dispatch |
| Control flow | if_icmpeq, if_icmplt, goto, tableswitch, lookupswitch | Branches and jumps |
| Return | ireturn, lreturn, freturn, dreturn, areturn, return | Method return with value |
| Stack ops | pop, pop2, dup, dup2, swap | Stack manipulation |
| Monitors | monitorenter, monitorexit | Synchronized block enter/exit |
Method Invocation Types — Critical Distinction
| Instruction | Use Case | Dispatch Mechanism |
|---|---|---|
invokevirtual | Regular instance method call | vtable lookup (polymorphic) |
invokeinterface | Interface method call | itable lookup (slower than vtable) |
invokestatic | Static method call | Direct (no dispatch needed) |
invokespecial | super(), this(), private, <init> | Direct reference, no vtable |
invokedynamic | Lambdas, method handles, Groovy/Kotlin | Bootstrap method + call site |
Template Interpreter — How Bytecode Dispatches
HotSpot's template interpreter does NOT use a switch/case bytecode loop. Instead, at JVM startup, it pre-generates assembly stubs for every bytecode. The dispatch table is a fixed array of code pointers. At the end of each bytecode stub, the interpreter reads the next opcode byte and jumps directly to the next stub — this is called threaded dispatch.
// Conceptual dispatcher (actual code is assembly generated by templateTable.cpp) // After each instruction, dispatch to next: // 1. Read byte at PC → opcode // 2. Increment PC // 3. Jump to dispatch_table[opcode] // This eliminates branch mispredictions on the dispatch switch // and allows CPU to pipeline bytecode execution
Runtime Data Areas — Deep Internals
Complete Memory Map
Metaspace Deep Dive
Metaspace replaced PermGen in Java 8. It resides in native memory (outside the Java heap), managed by its own allocator using memory-mapped files (mmap). Key structures stored in Metaspace:
- InstanceKlass: class metadata, field descriptors, method descriptors
- ConstantPool: the class's runtime constant pool (symbolic + resolved references)
- Method: method metadata, bytecode array, exception table
- vtable/itable: virtual dispatch tables for polymorphic method calls
- Annotations, generic signatures: reflection metadata
java.lang.Class mirror object on the heap, not in Metaspace. This means static fields ARE subject to GC (the Class object must remain reachable).Thread Stack — Frame Layout
Each method invocation creates a stack frame. The JVM spec defines the frame's logical contents: local variable table, operand stack, frame data. In HotSpot's template interpreter, additional hidden slots are added for interpreter state (method pointer, bytecode pointer, last SP saved for deoptimization).
| Memory Area | Thread-local? | GC-managed? | Overflow Error |
|---|---|---|---|
| Heap | ❌ Shared | ✅ Yes | OutOfMemoryError |
| Metaspace | ❌ Shared | ⚠️ On CL GC | OutOfMemoryError: Metaspace |
| Thread Stack | ✅ Per-thread | ❌ No | StackOverflowError |
| PC Register | ✅ Per-thread | ❌ No | N/A |
| Native Stack | ✅ Per-thread | ❌ No | Native stack overflow (SIGSEGV) |
| Code Cache | ❌ Shared | ❌ No | OutOfMemoryError: Code Cache |
HotSpot Object Memory Layout
Every Java object on the heap has a precise binary layout defined by HotSpot. Understanding this layout is essential for memory analysis, GC tuning, and off-heap programming with tools like Unsafe or JEP 454 (Foreign Memory API).
Object Layout Diagram
Mark Word States
| State | Bits Layout | When Active |
|---|---|---|
| Unlocked | [identity hash (25)][age (4)][0][01] | Normal unhashed, unlocked object |
| Biased | [thread ID (54)][epoch (2)][age (4)][1][01] | Lock biased toward a thread (eliminated in Java 21) |
| Lightweight | [ptr to lock record (62)][00] | Thread holds lock, uncontended |
| Heavyweight | [ptr to ObjectMonitor (62)][10] | Inflated monitor, contended lock |
| GC mark | [forwarding ptr (62)][11] | Object being moved by GC |
Compressed OOPs
On 64-bit JVMs with heap < 32GB, -XX:+UseCompressedOops (default on) stores object references as 32-bit values. The JVM transparently scales these compressed pointers by a factor of 8 (since all objects are 8-byte aligned), effectively addressing 32GB with 32 bits. This reduces memory footprint by ~30–40% compared to uncompressed 64-bit pointers.
// Object size calculation example class Example { int a; // 4 bytes long b; // 8 bytes byte c; // 1 byte Object ref; // 4 bytes (compressed oop) } // Layout (HotSpot field reordering): // +0: mark word 8 bytes // +8: klass pointer 4 bytes (compressed) // +12: int a 4 bytes ← HotSpot puts 4B fields first after klass // +16: long b 8 bytes ← then 8B fields // +24: byte c 1 byte // +25: 3 bytes padding // +28: ref (oop) 4 bytes (compressed) // Total: 32 bytes // Use JOL (Java Object Layout) to inspect: // System.out.println(ClassLayout.parseClass(Example.class).toPrintable());
Identity Hash Code in Mark Word
The first call to System.identityHashCode(obj) (or obj.hashCode() if not overridden) causes the JVM to compute a hash and store it in the mark word. This is a one-time, lazy computation. The hash value is permanently embedded in the object header — once set, the object can never be biased-locked (the bits are occupied).
Static vs Instance Variables — Memory Diagrams
class Test { static int a = 42; // class-level, stored in Class object on heap static String b = "hello"; // reference in Class object, String in String pool int x; // per-instance, in each object's heap allocation String name; // per-instance reference }
Key Rules
- Static variables live in the
java.lang.Classmirror object on the heap (since Java 8). They are GC roots only as long as the Class object is reachable. - Instance variables live inside each object allocation on the heap — zero-initialized when allocated, then set by constructors/initializers.
- Local variables (primitives) live in stack frame's local variable table — they are NOT initialized by default (must be explicitly assigned before use, enforced by the verifier).
- Local reference variables — the reference (pointer) is on the stack, but the object it points to is always on the heap.
| Variable Type | Where Stored | Default Value | GC Root? |
|---|---|---|---|
| Static primitive | Class object (heap) | 0 / false | Via Class object |
| Static reference | Class object (heap) | null | Via Class object |
| Instance primitive | Object body (heap) | 0 / false | Via enclosing object |
| Instance reference | Object body (heap) | null | Via enclosing object |
| Local primitive | Stack frame LVT | Undefined (error) | No (stack-scoped) |
| Local reference | Stack frame LVT | Undefined (error) | Yes (GC scans stacks) |
Heap Architecture Internals
Generational Heap Layout
Object Allocation Fast Path
Most allocations use the bump-pointer allocator in the thread's TLAB (Thread-Local Allocation Buffer). This is an O(1) operation: increment a pointer, check against the TLAB limit, done. No locking required.
// Pseudocode for fast allocation (actual: assembly in templateTable.cpp) oop fast_allocate(size_t size) { oop result = tlab_top; // current top of TLAB tlab_top += size; // bump pointer if (tlab_top <= tlab_end) { // fits in TLAB? memset(result, 0, size); // zero-initialize return result; // done — no lock! } // TLAB exhausted: slow path (refill or allocate directly in Eden) return slow_allocate(size); }
Promotion Rules
| Condition | What Happens |
|---|---|
| Object age >= MaxTenuringThreshold | Promoted to Old Gen |
| Survivor space >= 50% full (TargetSurvivorRatio) | Dynamic threshold lowering; some objects promoted early |
| Object too large for Eden (Humongous threshold) | G1: Allocated directly into Humongous region; CMS/Parallel: allocated in Old Gen |
| Old Gen GC pressure (Concurrent Mode Failure) | Fall-back Full GC (stop-the-world compaction) |
Thread-Local Allocation Buffers (TLAB)
TLABs are a critical performance optimization that allows threads to allocate memory without any synchronization. Each Java thread owns a private region of Eden space called a TLAB.
TLAB Sizing and Refill
TLAB size is adaptive. HotSpot tracks allocation rate per thread and adjusts TLAB size to balance between too-frequent refills (overhead) and too-large TLABs (waste/fragmentation). The target is approximately one TLAB refill per GC pause interval.
When a TLAB is exhausted and the requested object is small, a new TLAB is allocated from Eden using a CAS (compare-and-swap) on Eden's global free pointer. If the object is larger than the remaining TLAB space and smaller than -XX:TLABWasteTargetPercent, the JVM fills the TLAB remainder with a "filler" object (so GC can walk it) and allocates a new TLAB.
PLAB — Promotion-Local Allocation Buffers
During GC, surviving objects are copied to the Survivor or Old space. To avoid per-object CAS during copying, GC threads use PLABs — private buffers in the destination space. Multiple GC threads can work in parallel without contention. PLAB sizes are also adaptive.
| Flag | Default | Effect |
|---|---|---|
-XX:TLABSize | Adaptive (~2KB-1MB) | Initial TLAB size per thread |
-XX:+ResizeTLAB | true | Enable adaptive TLAB resizing |
-XX:TLABWasteTargetPercent | 1% | Max Eden waste from unfilled TLABs |
-XX:+PrintTLAB | false | Print TLAB statistics per GC |
Garbage Collection Algorithms — Deep Internals
Generational Hypothesis
The fundamental insight driving all generational collectors: most objects die young. Empirically, 80-98% of objects in typical Java workloads are unreachable after a single minor GC. This enables short, inexpensive minor GCs that reclaim most garbage without scanning long-lived objects.
Serial GC
Single-threaded stop-the-world collector. Uses mark-compact for Old Gen. Suitable for single-core machines and small heaps (<100MB). Activated with -XX:+UseSerialGC.
Parallel GC (Throughput Collector)
Multi-threaded stop-the-world. Uses parallel copying for Young Gen, parallel mark-compact for Old Gen. Optimizes for maximum throughput. Default in Java 8. Activated with -XX:+UseParallelGC. Key parameter: -XX:MaxGCPauseMillis (soft target).
G1 GC — Garbage First
Default collector since Java 9. Divides heap into equal-sized regions (default 1–32MB). No fixed Young/Old partitions — regions are tagged as Eden, Survivor, or Old dynamically. Concurrent marking runs alongside application threads. Then Garbage First: selects regions with most garbage for collection first.
ZGC — Z Garbage Collector
Low-latency collector targeting sub-millisecond pauses regardless of heap size (tested to 16TB). Uses colored pointers (load barriers) and region-based layout. Almost entirely concurrent — mark, relocate, and remap phases run concurrently with application threads. Stop-the-world pauses are <1ms.
Shenandoah GC
Also targeting low latency. Similar to ZGC in goals. Uses Brooks forwarding pointers — each object has an extra header field pointing to its current location. During concurrent compaction, the old copy's forwarding pointer redirects reads/writes to the new copy, enabling concurrent evacuation.
CMS (Concurrent Mark Sweep) — Deprecated
CMS was the first concurrent collector in HotSpot. It performs concurrent mark (with app threads running), then stop-the-world remark, then concurrent sweep. Does NOT compact — leads to heap fragmentation over time. Removed in Java 14 (-XX:+UseConcMarkSweepGC throws error).
GC Algorithm Comparison
| Collector | Pause Model | Throughput | Latency | Best Use Case |
|---|---|---|---|---|
| Serial | STW all phases | Low | High | Single-core, embedded, tiny heaps |
| Parallel | STW all phases (parallel) | High | Medium | Batch processing, throughput priority |
| G1 | Mostly concurrent + short STW | Good | Low | General purpose, >4GB heap |
| ZGC | Sub-ms STW | Good | Ultra-low | Latency-critical, huge heaps |
| Shenandoah | Sub-ms STW | Good | Ultra-low | Latency-critical, Red Hat systems |
GC Barriers — Write and Read Barriers
GC barriers are small code snippets injected by the JIT compiler (and interpreter) around heap read/write operations. They allow concurrent GC phases to maintain invariants without stopping the world.
Write Barrier — Card Marking
When an old-generation object's field is updated to point to a young-generation object, GC must know about it (otherwise the young-gen object could be collected because the old-gen reference isn't scanned during minor GC). The card table tracks which old-gen memory regions contain pointers to young-gen objects.
SATB Barrier — G1 / Shenandoah
Snapshot-At-The-Beginning (SATB): G1's concurrent marker takes a conceptual "snapshot" of the live object graph at the start of concurrent marking. As the application mutates the heap, any reference that is overwritten must be logged (to preserve the snapshot invariant). The write barrier logs old reference values to SATB queues.
// SATB write barrier pseudocode (G1) void satb_write_barrier(oop* field_addr) { oop old_value = *field_addr; if (marking_active && old_value != null) { // Log old value to SATB queue (to be rescanned) satb_queue.enqueue(old_value); } // Actual field write proceeds normally }
Load Barrier — ZGC
ZGC uses load barriers on every reference load from the heap. The barrier checks colored pointer metadata bits to determine if the referenced object needs to be relocated. If yes, the barrier updates the reference in-place to point to the new location. This enables concurrent relocation without stopping the world.
// ZGC load barrier pseudocode oop load_barrier(oop* addr) { oop ref = *addr; if (ref.metadata_bits & BAD_COLOR_BITS) { // Object needs remapping or relocation ref = slow_path_fixup(addr, ref); } return ref; // guaranteed to be in correct location }
Performance Impact of Barriers
Barriers add overhead to every heap read or write. ZGC's load barrier adds ~4ns overhead per reference load. G1's write barrier (card marking + SATB) adds ~2-5ns per write. JIT compilers inline barriers and apply optimizations to eliminate redundant barrier checks.
Safepoints — Stopping the World
A safepoint is a point in execution where all JVM threads are in a known, consistent state — enabling the JVM to perform operations that require exclusive access to the heap (GC, deoptimization, stack scanning, class redefinition).
How Safepoints Work
Safepoint Polling Mechanism
Modern HotSpot uses a dedicated polling page in virtual memory. Under normal execution, this page is readable (poll = read from page = no-op). When a safepoint is needed, the JVM makes this page inaccessible using mprotect(). Threads reading the page receive a SIGSEGV, which the JVM signal handler converts into a safepoint block.
// JIT-generated polling at method exit (x86_64 assembly) ;; After completing method body: testb rax, [safepoint_polling_page] ; poll the page ;; If page is accessible: no-op, continue ;; If page is protected: SIGSEGV → JVM blocks thread at safepoint
Deoptimization at Safepoints
When a JIT optimization is invalidated (e.g., an inlined virtual call's target class changes due to class loading), the JVM schedules a deoptimization. At the next safepoint, compiled frames are converted back to interpreter frames. The JVM uses on-stack replacement (OSR) to resume execution in the interpreter at the exact bytecode position where the deopt occurred.
Time-to-Safepoint Latency
A critical GC performance metric is "time to safepoint" — the time from requesting a safepoint to all threads reaching it. Causes of long TTSP: tight loops with no safepoint polls (rare in modern HotSpot), JNI-heavy workloads, or very long-running native calls. Monitor with -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=200.
JIT Compiler Architecture — C1 and C2
HotSpot has two JIT compilers that cooperate via tiered compilation. Understanding their compilation pipelines is essential for performance tuning and debugging unexpected slowdowns.
Tiered Compilation Levels
| Tier | Executor | Optimization Level | Profiling? |
|---|---|---|---|
| 0 | Template Interpreter | None | Method invocation + back-edge counters |
| 1 | C1 | Simple (no profiling) | No |
| 2 | C1 | Limited | Invocation + back-edge counters only |
| 3 | C1 | Full C1 | Full profiling: branch stats, type profiles, call target profiles |
| 4 | C2 | Maximum | No (uses Tier 3 profile data) |
C1 Compiler Pipeline
C2 Compiler Pipeline
Compilation Thresholds
// With TieredCompilation (default Java 8+) // Interpreter → C1 Tier 3 when: // invocation_count > CompileThreshold * InterpreterProfilePercentage / 100 // back_edge_count > OnStackReplacePercentage * CompileThreshold / 100 // (defaults: CompileThreshold=10000, InterpreterProfilePercentage=33) // C1 Tier 3 → C2 Tier 4 when: // C1-profiled invocation count exceeds threshold // (roughly 15,000 invocations total) // To force immediate C2 compilation (testing): // -XX:CompileThreshold=1 -XX:-TieredCompilation
Inline Cache and Megamorphic Call Sites
For invokevirtual, the first call site is monomorphic — C1/C2 inline the single observed target (using C1/C2 inlining). If a second type appears, the call site becomes bimorphic (two-way branch). If more than two types appear, it becomes megamorphic — the JIT falls back to a vtable dispatch without inlining. Megamorphic call sites are a significant optimization barrier; keeping call sites monomorphic is critical for performance-sensitive code.
JIT Optimizations — Deep Dive
Method Inlining
The most impactful JIT optimization. When a callee method is inlined, the call overhead disappears and the combined code graph enables further optimizations (constant folding, dead code elimination, etc.).
// Before inlining int result = obj.getValue(); // method call overhead int getValue() { return this.value; } // After inlining (C2's IR) int result = obj.value; // direct field access, no call
C2 inlines methods up to -XX:MaxInlineSize (default 35 bytecodes) and -XX:InlineSmallCode (default 1000 bytes of compiled code). Recursive inlining is bounded by -XX:MaxInlineLevel (default 9 levels).
Dead Code Elimination (DCE)
C2's sea-of-nodes IR naturally eliminates unreachable nodes. If a condition is provably always true/false (from profiling), the dead branch is eliminated. Example: after type-checking inlining, null checks on provably non-null references are eliminated.
Constant Folding and Propagation
final int X = 10; int y = X * 3; // folded to: int y = 30; at compile time if (y > 20) // folded to: if (true) → dead else branch eliminated
Loop Unrolling
Short loops with small fixed iteration counts are unrolled — the loop body is repeated N times with the loop control reduced or eliminated. This reduces branch prediction pressure and enables SIMD vectorization.
// Source for (int i = 0; i < 4; i++) arr[i] = i * 2; // After loop unrolling (4x): arr[0] = 0; arr[1] = 2; arr[2] = 4; arr[3] = 6; // No loop overhead at all!
Null Check Elimination
After a null check in one code path, C2 tracks that the reference is non-null on the "passes" path, eliminating subsequent redundant null checks. This is called flow-sensitive type refinement.
Range Check Elimination (RCE)
For array accesses inside loops with provably bounded indices, C2 hoists the range check outside the loop (checked once before the loop starts, not on every iteration):
// Before RCE: for (int i = 0; i < arr.length; i++) { // implicit: if (i < 0 || i >= arr.length) throw AIOOBE sum += arr[i]; } // After RCE: check hoisted, loop body is bounds-check-free if (arr.length > 0) { for (int i = 0; i < arr.length; i++) { sum += arr[i]; // no bounds check! } }
Escape Analysis — Stack Allocation and Scalar Replacement
Escape analysis determines whether an object allocated in a method can "escape" that method (i.e., be referenced from outside). Objects that do not escape can be subject to powerful optimizations.
Escape States
| State | Meaning | Optimization Possible |
|---|---|---|
| NoEscape | Object only used within the allocating method; never passed to another method or stored in a field | Scalar replacement + Stack allocation |
| ArgEscape | Object passed to methods but those methods don't store it globally | Lock elimination |
| GlobalEscape | Object stored in static field, returned from method, or passed to native | None — must heap allocate |
Scalar Replacement
When an object is NoEscape, C2 can decompose it into its individual fields as scalar values (removing the object entirely). This eliminates heap allocation, reduces GC pressure, and allows the scalar values to live in CPU registers.
class Point { int x, y; } int sumCoords() { Point p = new Point(); // NoEscape: p never leaves this method p.x = 3; p.y = 4; return p.x + p.y; } // After scalar replacement — NO heap allocation: int sumCoords() { int px = 3; // scalar: p.x int py = 4; // scalar: p.y return px + py; // constant-folded to: return 7 }
Lock Elimination
If an object is NoEscape or ArgEscape, its monitor lock can never be contended — no other thread can see it. C2 eliminates synchronized blocks on such objects entirely:
synchronized (new Object()) { // lock on NoEscape object doWork(); // synchronized block eliminated by JIT! } // Also applies to StringBuffer (synchronized) operations // if the StringBuffer doesn't escape the method StringBuffer sb = new StringBuffer(); sb.append("a").append("b"); // locks eliminated — sb is NoEscape
Escape Analysis Limits
Escape analysis in C2 is interprocedural within the inlining budget. Objects that escape through methods not inlined (too large, recursive, megamorphic) are not candidates. JVM flags:
-XX:+DoEscapeAnalysis (default: true in Java 8+) -XX:+EliminateAllocations (default: true — enable scalar replacement) -XX:+EliminateLocks (default: true — enable lock elimination) -XX:+PrintEscapeAnalysis (debug: print EA results)
Vectorization — SIMD Optimizations
Modern CPUs support SIMD (Single Instruction, Multiple Data) instructions — SSE2/SSE4, AVX/AVX2/AVX-512 on x86, NEON on AArch64. These instructions operate on 128–512 bit wide registers, processing 4–16 integers (or 2–8 doubles) per instruction. C2 auto-vectorizes certain loop patterns.
Auto-Vectorization Conditions
For C2 to vectorize a loop:
- Loop must have a countable iteration count
- No loop-carried dependencies (iterations must be independent)
- Array accesses must be sequential (stride-1)
- Operations must be vectorizable (arithmetic, comparisons, simple logical)
// Vectorizable loop: int[] a = ..., b = ..., c = ...; for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; // independent iterations — vectorized! } // C2 may emit (AVX2, processes 8 ints at a time): ; vmovdqu ymm0, [a + i*4] ; load 8 ints from a ; vmovdqu ymm1, [b + i*4] ; load 8 ints from b ; vpaddd ymm2, ymm0, ymm1 ; add 8 ints in parallel ; vmovdqu [c + i*4], ymm2 ; store 8 ints to c // NOT vectorizable (loop-carried dependency): for (int i = 1; i < n; i++) { a[i] = a[i-1] + b[i]; // depends on previous iteration }
Panama Vector API (Java 16+ incubator, stable Java 21+)
The Vector API provides explicit SIMD programming in Java, guaranteed to map to hardware vector instructions. Unlike auto-vectorization (best-effort), the Vector API guarantees vector execution:
import jdk.incubator.vector.*; VectorSpecies<Integer> SPECIES = IntVector.SPECIES_256; // 256-bit = 8 ints void addArrays(int[] a, int[] b, int[] c) { for (int i = 0; i < a.length; i += SPECIES.length()) { IntVector va = IntVector.fromArray(SPECIES, a, i); IntVector vb = IntVector.fromArray(SPECIES, b, i); va.add(vb).intoArray(c, i); // SIMD add, guaranteed } }
Java Memory Model — Happens-Before and Visibility
The Java Memory Model (JMM), specified in JSR-133, defines the semantics of multithreaded Java programs. Without the JMM, compilers and CPUs can reorder memory operations in ways that produce counterintuitive results.
Memory Visibility Problem
// Without synchronization — BROKEN boolean flag = false; // Thread 1: data = compute(); // (1) flag = true; // (2) // Thread 2: while (!flag) {} // spin (3) use(data); // (4) MAY SEE STALE DATA! // Thread 2 could see flag=true but data still 0 // CPU can reorder (1) and (2) // CPU's store buffer may not flush to cache in order
Happens-Before Rules
Action A happens-before action B means: A's effects are guaranteed to be visible to B. The JMM defines the following happens-before edges:
| Rule | HB Edge |
|---|---|
| Program order | Within a single thread: A before B in source → A HB B |
| Monitor lock | unlock(m) HB every subsequent lock(m) on the same monitor |
| Volatile write | volatile write to v HB every subsequent volatile read of v |
| Thread start | Thread.start() HB all actions in the started thread |
| Thread join | All actions in T HB T.join() returning |
| Object finalization | End of constructor HB start of finalizer |
| Transitivity | If A HB B and B HB C, then A HB C |
Volatile Semantics
A volatile field provides two guarantees: visibility (every read sees the last write) and ordering (no reordering across a volatile access). The JIT must not cache volatile variables in registers and must emit memory fences.
Double-Checked Locking — The Classic Trap
// BROKEN before Java 5 / without volatile: private static Singleton instance; public static Singleton getInstance() { if (instance == null) { synchronized (Singleton.class) { if (instance == null) instance = new Singleton(); // UNSAFE! } } return instance; } // Problem: "new Singleton()" is NOT atomic: // 1. Allocate memory for Singleton // 2. Call constructor (initialize fields) // 3. Assign reference to instance // Steps 2 and 3 CAN be reordered by CPU/compiler! // Thread B may see a non-null but uninitialized instance // CORRECT — volatile prevents reordering: private static volatile Singleton instance; // volatile write in step 3 prevents CPU from reordering with step 2
Memory Barrier Types
| Barrier | Prevents | JMM Use Case |
|---|---|---|
| LoadLoad | Load reordering with prior load | Volatile read |
| LoadStore | Store reordering with prior load | Volatile read |
| StoreStore | Store reordering with prior store | Before volatile write |
| StoreLoad | Load reordering with prior store | After volatile write (strongest; prevents all reordering) |
JVM Synchronization Internals
Every Java object is a potential lock (monitor). The JVM implements a three-tier locking strategy to minimize overhead for the common case (no contention).
Lock State Transitions
Lightweight Locking
When a thread enters a synchronized block (and biased locking is not applicable), it performs a CAS to atomically swap the mark word: it stores the original mark word in a lock record on its stack, then sets the mark word to point to this lock record (with lock bits = 00). If the CAS succeeds, the thread owns the lock. On exit, CAS swaps back. This is O(1) and requires no OS kernel involvement.
// Lightweight lock entry (monitorenter bytecode) lock_record.displaced_header = object->mark_word() // save original if (CAS(object->mark_word, original, ptr_to_lock_record | LOCKED_BITS)) { // CAS succeeded: thread owns the lock } else if (is_our_lock_record(object->mark_word)) { // Recursive lock: increment recursion count } else { // Contention: inflate to heavyweight (ObjectMonitor) inflate_and_enter(object); }
ObjectMonitor — Heavyweight Lock Internals
When a lock is inflated, HotSpot allocates an ObjectMonitor C++ object. This contains: the owner thread, an entry list (waiting to acquire), a wait set (threads in Object.wait()), and a recursion counter.
// ObjectMonitor key fields (hotspot/src/share/vm/runtime/objectMonitor.hpp) class ObjectMonitor { volatile markWord _header; // saved displaced mark word volatile JavaThread* _owner; // owning thread volatile intptr_t _recursions; // recursion depth ObjectWaiter* _EntryList; // threads blocked on entry ObjectWaiter* _WaitSet; // threads in wait() volatile int _waiters; // count of wait() callers };
Biased Locking Removed (Java 21)
Biased locking was removed as a feature in Java 21 (deprecated in Java 15, removed in Java 21). Modern hardware CAS operations are cheap enough that the biased locking optimization's benefit (eliminating CAS on uncontended locks) was outweighed by the cost of revoking bias at safepoints when contention did occur.
-XX:-UseBiasedLocking to avoid biased locking overhead. In Java 21+, biased locking is gone. Measure lock contention with jstack or JFR LockEvent.String Pool Internals
Java's String class is immutable, making sharing safe. The JVM maintains a String Constant Pool (also called the String intern table) in the heap (since Java 7) backed by a fixed-size hash table.
How String Literals Are Interned
intern() and == vs equals()
String s1 = "Java"; // interned, pool ref String s2 = "Java"; // same pool ref as s1 String s3 = new String("Java"); // new heap object, NOT pool String s4 = s3.intern(); // returns pool ref System.out.println(s1 == s2); // TRUE — same pool object System.out.println(s1 == s3); // FALSE — different objects System.out.println(s1 == s4); // TRUE — s4 is pool ref System.out.println(s1.equals(s3)); // TRUE — same content
Compile-Time String Concatenation (Java 9+ invokedynamic)
Since Java 9, string concatenation using + uses invokedynamic with StringConcatFactory (rather than StringBuilder as in Java 8). This enables JVM-level optimization of string building strategies at runtime.
// Java source: String s = name + " has age " + age; // Java 8 bytecode (old StringBuilder approach): // new StringBuilder → append(name) → append(" has age ") → append(age) → toString() // Java 9+ bytecode (invokedynamic approach): // invokedynamic #1 (StringConcatFactory.makeConcatWithConstants) // Bootstrap: generates optimized byte[] assembly at runtime
String Deduplication (G1 GC)
With -XX:+UseStringDeduplication (G1 only), the GC deduplicated String char arrays that have the same content. This doesn't affect String object identity (different String objects remain different), but their underlying char[] backing arrays are merged, saving memory in workloads with many duplicate strings.
Machine Code Generation — From IR to Assembly
C2's backend translates its Ideal graph (sea of nodes) through a series of steps to produce native machine code. Understanding this helps diagnose JIT compilation failures and unexpected performance characteristics.
Register Allocation
C2 uses graph coloring register allocation. Variables (IR nodes) are mapped to machine registers; when there are more live values than registers, the allocator spills to the stack. The register allocation quality directly determines instruction density and spill overhead.
Generated Assembly Example
// Java source: public static long fibonacci(int n) { if (n <= 1) return n; return fibonacci(n - 1) + fibonacci(n - 2); } // C2-generated x86_64 (simplified, after inlining base case): fibonacci: ; prolog: stack frame setup push rbp mov rbp, rsp ; check n <= 1 cmp edi, 1 jle .base_case ; fibonacci(n-1) lea edi, [rdi - 1] call fibonacci mov rbx, rax ; save result ; fibonacci(n-2) lea edi, [rdi - 2] call fibonacci add rax, rbx ; sum results pop rbp ret .base_case: movsx rax, edi pop rbp ret
Instruction Scheduling
C2 schedules instructions to hide CPU pipeline latency. For example, a load (3–4 cycle latency) followed by an immediate use creates a pipeline stall. C2 inserts independent instructions between the load and its first use to overlap execution. On out-of-order CPUs (all modern x86/ARM), the CPU also performs hardware instruction scheduling.
PrintAssembly — Viewing JIT Output
# Requires hsdis library for disassembly
java -XX:+PrintCompilation \
-XX:+UnlockDiagnosticVMOptions \
-XX:+PrintAssembly \
-XX:CompileCommand=print,MyClass.myMethod \
MyClass
JVM Performance Engineering
Essential JVM Flags Reference
| Flag | Purpose | Typical Value |
|---|---|---|
-Xms | Initial heap size | -Xms2g (set = Xmx to avoid resizing) |
-Xmx | Maximum heap size | -Xmx8g |
-Xss | Thread stack size | -Xss512k (reduce for many threads) |
-XX:MetaspaceSize | Initial Metaspace size | -XX:MetaspaceSize=256m |
-XX:MaxMetaspaceSize | Metaspace ceiling (important!) | -XX:MaxMetaspaceSize=512m |
-XX:+UseG1GC | Use G1 collector | Default Java 9+ |
-XX:+UseZGC | Use ZGC collector | Low-latency apps |
-XX:MaxGCPauseMillis | G1 pause target (soft) | -XX:MaxGCPauseMillis=200 |
-XX:GCTimeRatio | Throughput ratio (1/(1+ratio)) | -XX:GCTimeRatio=9 (10% GC overhead max) |
-XX:+PrintGCDetails | Verbose GC logging | Use in staging/production |
-Xlog:gc*:file=/path/gc.log | Unified GC logging (Java 9+) | Always enable in production |
-XX:ReservedCodeCacheSize | JIT code cache size | -XX:ReservedCodeCacheSize=512m |
-XX:+TieredCompilation | Enable tiered JIT (default) | Leave enabled |
-XX:+HeapDumpOnOutOfMemoryError | Dump heap on OOM | Always enable in production |
-XX:HeapDumpPath=/dumps/ | Where to write heap dump | On fast disk |
GC Tuning Strategy
Step 1: Define your goals — throughput, latency, footprint (pick two). Step 2: Choose the right collector (G1 for general purpose, ZGC/Shenandoah for latency, Parallel for throughput). Step 3: Set heap size (Xms=Xmx to avoid heap resizing pauses). Step 4: Tune GC-specific parameters. Step 5: Measure with real workload, iterate.
Monitoring Tools
| Tool | Command | What It Shows |
|---|---|---|
jstat | jstat -gcutil <pid> 1000 | GC statistics: heap usage %, GC count, GC time per second |
jmap | jmap -heap <pid> | Heap configuration, usage; -histo for live object histogram |
jstack | jstack <pid> | Thread dumps: detect deadlocks, hotspots, blocked threads |
jcmd | jcmd <pid> VM.flags | JVM flags, GC stats, heap info, thread info, JFR control |
Java Flight Recorder | jcmd <pid> JFR.start duration=60s filename=recording.jfr | CPU, memory, GC, I/O, lock contention profiling with <1% overhead |
JVM Mission Control | GUI for JFR analysis | Flame graphs, GC analysis, lock analysis, allocation profiling |
async-profiler | ./profiler.sh -d 30 -f profile.html <pid> | CPU/allocation/lock profiling via AsyncGetCallTrace (no safepoint bias) |
Finding Memory Leaks
# 1. Take heap dump jcmd <pid> GC.heap_dump /tmp/heap.hprof # 2. Analyze with Eclipse Memory Analyzer (MAT) # - Open heap dump # - Run "Leak Suspects Report" # - Look for dominator tree — largest retained heaps # - Find ClassLoader leaks via "Class Loader Explorer" # Common leak patterns: # 1. Static collections holding object references (static Map, List) # 2. Listener/callback registrations never deregistered # 3. ThreadLocal values not removed (especially in thread pools) # 4. Inner class references to outer class (anonymous listeners) # 5. ClassLoader leaks in hot-deploy scenarios
JVM Failure Analysis
StackOverflowError
Thrown when the JVM thread stack reaches its maximum size (-Xss). Each method call adds a stack frame; deeply recursive methods or infinite recursion exhaust the stack.
// Cause: infinite recursion void recurse() { recurse(); } // StackOverflowError after ~1000-10000 frames // Cause: legitimate deep recursion on large inputs // Fix: convert to iterative + explicit Stack data structure // Diagnosis: // jstack shows all frames in the thread that overflowed // Look for repeating frame pattern at the bottom
OutOfMemoryError Variants
| OOM Message | Root Cause | Diagnosis |
|---|---|---|
Java heap space | Heap exhausted; too many live objects | Heap dump + MAT; check -Xmx; find leaks |
GC overhead limit exceeded | JVM spending >98% time in GC with <2% reclaimed | Increase heap; find allocation hotspots with JFR |
Metaspace | Metaspace exhausted; too many loaded classes | jcmd VM.classloaders; check for ClassLoader leaks |
unable to create new native thread | OS thread limit or process memory exhausted | Reduce -Xss; reduce thread count; check ulimits |
Direct buffer memory | Off-heap direct ByteBuffer exhausted | Increase -XX:MaxDirectMemorySize; find unreleased buffers |
Code Cache | JIT code cache full; JIT compilation disabled | Increase -XX:ReservedCodeCacheSize; look for code cache flushing |
Metaspace Leak — ClassLoader Leak Pattern
// Leak pattern: ClassLoader held alive by Thread Thread t = new Thread(task); t.setContextClassLoader(customCL); // Thread holds reference to CL t.start(); // If thread stays alive in thread pool, customCL (and all its classes) are retained // Fix: clear TCCL before returning thread to pool try { task.run(); } finally { Thread.currentThread().setContextClassLoader(originalCL); }
Diagnosing Long GC Pauses
# Enable detailed GC logging -Xlog:gc+phases*=debug:file=gc.log:time,uptime:filecount=5,filesize=20m # Common causes of long pauses: # 1. Huge survivor spaces → large copy overhead # Fix: -XX:SurvivorRatio, -XX:MaxTenuringThreshold # 2. Long time-to-safepoint # Monitor with: -XX:+PrintSafepointStatistics # 3. G1 Humongous allocations causing early mixed GC # Increase -XX:G1HeapRegionSize # 4. Concurrent Mode Failure (G1: evacuation failure) # Increase heap or tune InitiatingHeapOccupancyPercent
JVM Debugging Tools — Complete Reference
jstack — Thread Analysis
jstack <pid> # thread dump to stdout jstack -l <pid> # + lock information kill -3 <pid> # trigger thread dump via signal jcmd <pid> Thread.print # alternative via jcmd Thread states in jstack output: RUNNABLE - executing or ready to run on CPU BLOCKED - waiting for a monitor lock WAITING - Object.wait(), Thread.join(), LockSupport.park() TIMED_WAITING - same but with timeout NEW - not yet started TERMINATED - finished execution
jmap — Heap Analysis
jmap -heap <pid> # heap summary (generation sizes) jmap -histo <pid> # live object histogram (class → count, bytes) jmap -histo:live <pid> # force GC first, then histogram jmap -dump:live,format=b,file=h.hprof <pid> # heap dump
jcmd — Swiss Army Knife
jcmd <pid> help # list all available commands jcmd <pid> VM.flags # all JVM flags (including defaults) jcmd <pid> VM.system_properties # system properties jcmd <pid> VM.version # JVM version info jcmd <pid> GC.run # trigger GC (hint) jcmd <pid> GC.heap_info # heap usage jcmd <pid> GC.heap_dump /tmp/h.hprof # heap dump jcmd <pid> Thread.print # thread dump jcmd <pid> VM.classloaders # ClassLoader hierarchy jcmd <pid> Compiler.queue # JIT compilation queue jcmd <pid> Compiler.codecache # code cache usage # Java Flight Recorder jcmd <pid> JFR.start name=myRec duration=60s filename=rec.jfr jcmd <pid> JFR.dump name=myRec filename=rec.jfr jcmd <pid> JFR.stop name=myRec
Java Flight Recorder (JFR)
JFR is a production-safe, low-overhead profiler built into the JVM. It records time-series data about JVM and application events: CPU usage, garbage collection, JIT compilation, I/O, lock acquisition, memory allocation (with stack traces), exceptions, and custom application events. Overhead is <1% CPU in most workloads.
| JFR Event Category | What It Reveals |
|---|---|
| GC events | Pause times, GC causes, before/after heap sizes, phase breakdown |
| JIT compilation | Method compilation times, code cache usage, deoptimization events |
| Allocation profiling | Top allocating methods and classes (with stack traces) |
| Lock contention | Locks with highest contention, average wait time, waiters |
| CPU/Method profiling | Hot methods consuming CPU (async sampling) |
| I/O profiling | File and network read/write latency breakdown |
Complete JVM Memory Map
Object Cross-Reference Map
Bytecode to CPU — Full Execution Pipeline
When JIT-compiled code executes on the CPU, it passes through the CPU's own pipeline stages. Understanding CPU execution behavior is essential for the highest-level JVM performance work.
CPU Execution Pipeline (Modern Out-of-Order x86_64)
Branch Prediction and JVM Code
JVM bytecode dispatch in the template interpreter relies heavily on branch prediction. Hot loops in JIT code are predictable (same branch taken every iteration). Virtual dispatch through vtables is predictable for monomorphic call sites. The JIT generates type-check inlining guards that are predicted well by the CPU branch predictor.
Cache Effects on Java Object Graphs
Java's object-per-allocation model creates pointer-heavy data structures. Traversing an array of object references involves a pointer indirection per element — each reference load can miss the L1 cache. Key optimization: use value-type arrays (int[], long[]) or off-heap layouts to enable sequential memory access and CPU cache line prefetching.
// SLOW: pointer indirection, random cache misses Integer[] boxed = new Integer[1_000_000]; long sum = 0; for (Integer i : boxed) sum += i; // each access = potential cache miss // FAST: sequential memory, prefetched by CPU int[] primitives = new int[1_000_000]; long sum = 0; for (int i : primitives) sum += i; // cache-line friendly, vectorizable
JIT Deoptimization and Branch Prediction
When C2 deoptimizes a method (due to class loading invalidating an inlined assumption), control returns to the interpreter at a specific bytecode. Frequent deoptimizations can harm branch predictor state. Monitor deoptimizations with -XX:+PrintDeoptimization -XX:+UnlockDiagnosticVMOptions.
Advanced JVM Interview Questions & Traps
Class Initialization Traps
static final int X = 5 is inlined by javac at every use site. Accessing it never loads or initializes the declaring class.
Child.parentStaticField initializes Parent, NOT Child. The JLS says initialization happens on the declaring class.
If
<clinit> throws, the class is marked as failed. Every subsequent use of that class throws NoClassDefFoundError (not ExceptionInInitializerError — that's only on the first attempt).
String Pool Traps
String s1 = "a" + "b" + "c"; // compile-time: "abc" literal — pooled String s2 = "abc"; // same pool object as s1 System.println(s1 == s2); // TRUE String a = "ab"; String b = a + "c"; // runtime concat — NOT pooled System.println(b == "abc"); // FALSE final String a2 = "ab"; // compile-time constant String b2 = a2 + "c"; // compile-time concat → "abc" literal System.println(b2 == "abc"); // TRUE — javac inlines the concat
Object Reference vs Object
void modify(Object o) { o = new Object(); // DOES NOT affect caller's reference } // Java passes references by VALUE — you can't change the caller's variable // But you CAN mutate the object the reference points to
JVM Advanced Q&A
| Question | Answer |
|---|---|
| Where are static variables stored in Java 8+? | In the java.lang.Class mirror object, which is on the heap. Not in Metaspace. Not on the stack. |
| Can the JVM collect a class from Metaspace? | Yes, if the ClassLoader that loaded it becomes unreachable. Then all classes it loaded (and their Metaspace data) are freed. |
| Does System.gc() guarantee collection? | No. It's a hint. The JVM may ignore it. Use -XX:+ExplicitGCInvokesConcurrent to make it trigger G1 concurrent cycle. |
What is the difference between == and equals() for Integer? | Integer caches values -128 to 127 in an IntegerCache. Integer.valueOf(100) == Integer.valueOf(100) is true (cached). Integer.valueOf(200) == Integer.valueOf(200) is false (outside cache range). Always use equals() for Integer comparison. |
| What is a safepoint? | A point in execution where all Java threads are paused and the JVM has exclusive, consistent access to the heap. Required for GC, deoptimization, class redefinition, and stack sampling. |
| What causes megamorphic call site degradation? | More than 2 different concrete types observed at a virtual call site. C2 cannot inline megamorphic sites, and the JIT falls back to vtable dispatch, losing all inlining-dependent optimizations. |
| What is false sharing? | Two threads writing to different variables that share the same CPU cache line (64 bytes). Each write by one thread invalidates the other thread's cache copy, causing constant cache coherency traffic. Mitigate with @Contended padding or cache-line-aligned allocation. |
| What triggers JIT deoptimization? | Class loading events that invalidate CHA (Class Hierarchy Analysis) assumptions (e.g., a new subclass loaded after inlining), null check failures on speculated non-null refs, type profile changes at guarded inlines, and explicit deoptimization for debugging. |
| Why might -Xss affect thread count? | Each thread requires OS virtual memory for its stack (default 512KB–1MB). With 10,000 threads and 1MB stack size, that's 10GB of virtual address space just for stacks. Reducing -Xss allows more threads but risks StackOverflowError in deeply recursive methods. |
| What is on-stack replacement (OSR)? | JIT compilation of a currently-executing method's loop body, replacing the interpreter frame mid-execution with a compiled frame. Allows hot loops discovered at runtime to be compiled without waiting for the method to complete and be re-invoked. |
Lock Ordering and Deadlock
// Classic deadlock pattern synchronized (lockA) { synchronized (lockB) { } // Thread 1: acquires A, then B } synchronized (lockB) { synchronized (lockA) { } // Thread 2: acquires B, then A → DEADLOCK } // Detect with: jstack -l <pid> → look for "Found one Java-level deadlock:" // Fix: always acquire locks in consistent global order if (System.identityHashCode(lockA) < System.identityHashCode(lockB)) { synchronized (lockA) { synchronized (lockB) { } } } else { synchronized (lockB) { synchronized (lockA) { } } }
ThreadLocal Memory Leak
static ThreadLocal<HeavyObject> tl = new ThreadLocal<>(); // In a thread pool worker: tl.set(new HeavyObject()); doWork(); // FORGOT: tl.remove(); // Thread returns to pool. ThreadLocal entry retained FOREVER. // HeavyObject is never GC'd as long as the thread is alive. // Fix: ALWAYS use try/finally to call tl.remove() try { tl.set(new HeavyObject()); doWork(); } finally { tl.remove(); // mandatory! }